perf(parser): Make all parsers work by byte instead of char #11318

kdy1 · 2025-11-20T12:50:47Z

No description provided.

changeset-bot · 2025-11-20T12:50:54Z

⚠️ No Changeset found

Latest commit: 7b7f3f2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

claude · 2025-11-20T12:52:52Z

Code Review: PR #11318 - Performance optimization: Make all parsers work by byte instead of char

Overview

This PR refactors the input parsing infrastructure to operate on bytes instead of chars, which is a significant performance optimization. The changes are substantial but well-structured.

Positive Aspects ✅

Performance-focused design: The byte-level operations should significantly improve parsing performance, especially for ASCII-heavy code (which is the common case).
Careful UTF-8 handling: The bump() implementation in crates/swc_common/src/input.rs:133-156 correctly calculates UTF-8 character lengths:
- 1 byte for ASCII (< 0x80)
- 2 bytes for chars < 0xe0
- 3 bytes for chars < 0xf0
- 4 bytes for chars >= 0xf0
Fallback mechanism: The code properly maintains cur_as_char() for cases where full Unicode character processing is needed (identifiers, strings, etc.), showing good design thinking.
Test coverage: The existing unit tests have been updated to reflect the byte-based API changes.

Issues & Concerns 🔴

1. Potential UTF-8 boundary safety issue (High Priority)

Location: crates/swc_common/src/input.rs:144-152

The UTF-8 length calculation in bump() uses a simplified heuristic that may not be robust for all invalid UTF-8 sequences:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Issue: This doesn't validate that the subsequent bytes are valid UTF-8 continuation bytes (should be in range 0x80-0xBF). While the input is presumably valid UTF-8 (since it comes from a &str), slicing at incorrect boundaries could cause panics in debug builds or undefined behavior.

Recommendation: Since the input is &str, we know it's valid UTF-8. However, add a comment explaining this assumption and consider using str::chars().next() to get the actual character length for the fast path, or add debug assertions to validate continuation bytes.

2. Unsafe code without complete documentation (Medium Priority)

Locations: Multiple instances throughout the codebase

Many unsafe blocks lack detailed safety comments explaining why the operation is safe. Examples:

crates/swc_common/src/input.rs:71, 79, 154
crates/swc_ecma_lexer/src/common/lexer/mod.rs: numerous unsafe slicing operations

Recommendation: Per the project's guidelines (#1 and #7 in CLAUDE.md), add comprehensive safety documentation. For example:

// Safety: first_byte comes from self.remaining.as_bytes() which is guaranteed to be
// valid UTF-8. The length calculation follows UTF-8 encoding rules (RFC 3629).
// get_unchecked is safe because we've checked that remaining is not empty.

3. Incorrect ASCII validation (Medium Priority)

Location: crates/swc_common/src/input.rs:160-165

fn cur_as_ascii(&self) -> Option<u8> {
    let first_byte = *self.remaining.as_bytes().first()?;
    if first_byte <= 0x7f {
        Some(first_byte)
    } else {
        None
    }
}

Issue: The check first_byte <= 0x7f should be first_byte < 0x80 for consistency with ASCII definition (0-127). While functionally equivalent, using < 0x80 is more idiomatic and matches the pattern used elsewhere (line 144).

Recommendation: Change to if first_byte < 0x80 for consistency.

4. Missing boundary checks (Medium Priority)

Location: crates/swc_common/src/input.rs:84-92

The bump_one() method has a debug_assert! but will silently corrupt state in release builds if called at EOF:

pub fn bump_one(&mut self) {
    if !self.remaining.is_empty() {
        self.remaining = unsafe { self.remaining.get_unchecked(1..) };
        self.last_pos.0 += 1;
    } else {
        unsafe {
            debug_unreachable!("bump should not be called when cur() == None");
        }
    }
}

Issue: The else branch will still execute in release mode but does nothing, potentially causing position tracking issues.

Recommendation: Consider removing the else branch entirely or adding a panic in debug mode that becomes an unreachable hint in release mode.

5. Complex escape sequence handling (Low Priority)

Location: crates/swc_css_parser/src/lexer/mod.rs:709-787

The string reading logic has become more complex with the byte-based approach, particularly around escape sequences where we need to convert bytes back to chars:

let cur_byte = l.input.cur();
let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

Recommendation: This is correct but verbose. Consider extracting this pattern into a helper method like cur_as_char_for_string() to reduce duplication and improve readability.

Performance Considerations ⚡

Excellent byte-search optimization: The use of byte_search! macros with lookup tables (lines 51-66 in lexer/mod.rs) for scanning is an excellent optimization pattern.
Smart fast-path design: The code correctly identifies that ASCII is the common case and optimizes for it while falling back to slower Unicode handling when needed.
Reduced allocations: Operating on byte slices rather than iterating over chars should reduce allocations and improve cache locality.

Testing Recommendations 🧪

Per CLAUDE.md guideline #5, please ensure:

Add tests for edge cases:
- Multi-byte UTF-8 sequences (2, 3, and 4-byte sequences)
- Strings containing emoji and other non-BMP characters
- Mixed ASCII and Unicode content
- Strings with escape sequences followed by Unicode
Add benchmarks comparing before/after performance, especially for:
- Large ASCII files
- Files with mixed ASCII/Unicode
- Files with many escape sequences

Example test to add:

#[test]
fn test_multibyte_utf8() {
    with_test_sess("hello 世界 🌍", |mut i| {
        // Test bump() correctly handles multi-byte sequences
        assert_eq!(i.cur(), Some(b'h'));
        // ... test full sequence
    });
}

Documentation 📚

Per CLAUDE.md guideline #7, please add:

Module-level documentation explaining the byte-based approach and when to use cur_as_char()
Safety invariants for the Input trait
Examples showing proper usage patterns

Minor Issues

Formatting: Remember to run cargo fmt --all before committing (CLAUDE.md Unignore "431ecef8c85d4d24.js" and "8386fbff927a9e0e.js". #8)
Code comment clarity: Some comments could be more precise:
- Line 13 in input.rs: "Remaining input as str" could be "Remaining unparsed input slice"

Summary

This is a well-thought-out performance optimization with good architectural decisions. The main concerns are:

Must fix: UTF-8 boundary handling needs more robust validation or better documentation of safety assumptions
Should fix: Unsafe code needs comprehensive safety documentation
Should fix: Add comprehensive tests for multi-byte UTF-8 sequences
Nice to have: Minor consistency improvements and helper methods for common patterns

The performance benefits should be significant, especially for typical JavaScript/TypeScript code which is mostly ASCII. Great work on maintaining the fast-path optimization pattern!

Overall Assessment: Approve with requested changes ✅ (pending fixes for UTF-8 safety and documentation)

🤖 Generated with Claude Code

github-actions · 2025-11-20T13:00:25Z

Binary Sizes

File	Size
`swc.linux-x64-gnu.node`	31M (31932360 bytes)

Commit: e086996

codspeed-hq · 2025-11-20T13:03:35Z

CodSpeed Performance Report

Merging #11318 will degrade performances by 3.72%

_{Comparing kdy1/parser-by-byte (7b7f3f2) with main (be86fad)}

Summary

⚡ 3 improvements
❌ 5 regressions
✅ 130 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`css/lexer/bootstrap_5_1_3`	11 ms	11.4 ms	-3.72%
❌	`css/lexer/foundation_6_7_4`	9 ms	9.3 ms	-3.43%
❌	`css/lexer/tailwind_3_1_1`	1.8 ms	1.8 ms	-3.28%
❌	`es/lexer/cal-com`	13 ms	13.5 ms	-3.21%
❌	`es/lexer/typescript`	106.9 ms	109.9 ms	-2.73%
⚡	`html/lexer/css_2021_spec`	40 ms	38.9 ms	+2.93%
⚡	`html/lexer/github_com_17_05_2022`	14.4 ms	14.1 ms	+2.34%
⚡	`html/lexer/stackoverflow_com_17_05_2022`	14.5 ms	14.1 ms	+2.68%

claude · 2025-11-20T19:58:00Z

Code Review for PR #11318: perf(parser): Make all parsers work by byte instead of char

I've reviewed this PR which converts the parsers from character-based to byte-based processing for improved performance. Here's my detailed feedback:

✅ Strengths

Well-structured refactoring: The conversion from char to u8 is systematic and consistent across all parsers (CSS, ECMAScript, HTML, XML).
Smart hybrid approach: The code correctly distinguishes between:
- Fast path: ASCII bytes (< 0x80) processed directly as bytes
- Slow path: Non-ASCII bytes where cur_as_char() is called to get full UTF-8 characters

Correct UTF-8 byte length calculation: The bump() implementation in crates/swc_common/src/input.rs:143-152 correctly calculates UTF-8 character lengths:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

This matches the UTF-8 specification correctly.

Performance optimization: The new bump_bytes(n) method allows efficient advancement when the byte count is already known, avoiding redundant character boundary calculations.
Proper bump_one() usage: The method is correctly restricted to single-byte ASCII operations (see crates/swc_ecma_parser/src/lexer/table.rs:346).

⚠️ Potential Issues & Concerns

1. Invalid UTF-8 handling (Medium severity)

The UTF-8 byte length calculation assumes valid UTF-8 input. For invalid bytes (0xC0-0xC1, 0xF5-0xFF), it will calculate incorrect lengths:

0xC0, 0xC1: Invalid overlong encodings, but treated as 2-byte
0xF5-0xFF: Invalid (beyond Unicode range), but treated as 4-byte

Recommendation: Since parsers typically receive valid UTF-8 from Rust strings, this is acceptable. However, consider adding a debug assertion or documentation comment noting this assumption.

Location: crates/swc_common/src/input.rs:143-152

2. Missing UTF-8 boundary validation in `bump_bytes()` (Low severity)

The bump_bytes() method has a debug assertion for length but doesn't verify UTF-8 character boundaries:

pub fn bump_bytes(&mut self, n: usize) {
    debug_assert!(n <= self.remaining.len());
    self.remaining = unsafe { self.remaining.get_unchecked(n..) };
    self.last_pos.0 += n as u32;
}

Recommendation: Add a debug assertion or documentation comment that callers must ensure n aligns with UTF-8 character boundaries.

Location: crates/swc_common/src/input.rs:76-80

3. Character conversion pattern in CSS/HTML parsers (Low severity)

Throughout the CSS and HTML lexers, there's a repetitive pattern:

let ch = if is_non_ascii(c) {
    l.input.cur_as_char()
} else {
    Some(c as char)
};

This is done before consuming the byte. While correct, it adds cognitive overhead.

Recommendation: Consider creating a helper method like cur_as_full_char() that encapsulates this pattern. Example:

impl Input {
    fn cur_as_full_char(&self) -> Option<char> {
        match self.cur()? {
            c if c < 0x80 => Some(c as char),
            _ => self.cur_as_char()
        }
    }
}

Locations: Multiple occurrences in crates/swc_css_parser/src/lexer/mod.rs and crates/swc_html_parser/src/lexer/mod.rs

4. Test coverage (Medium severity)

I notice the PR has a test failure ("Test with @swc/cli" failed). Additionally:

No new unit tests specifically for the byte-based Input trait changes
Existing tests in crates/swc_common/src/input.rs:392-425 were updated but not expanded

Recommendation:

Fix the failing test
Add unit tests covering:
- Multi-byte UTF-8 characters (2, 3, and 4 bytes)
- Mixed ASCII and non-ASCII text
- bump_bytes() with various byte counts
- Edge cases with special Unicode characters (emoji, etc.)

5. Performance verification (Low severity)

While the change should improve performance, I don't see benchmark results in the PR description.

Recommendation: Run the existing benchmarks and include results showing the performance improvement this change provides.

📋 Code Quality

Comments: The code has good inline comments explaining the byte-based approach
Follows CLAUDE.md: ✅ The code avoids unstable features, has comments in English, and appears performant
Documentation: ⚠️ The trait methods could use more doc comments explaining the byte vs. char distinction

🔒 Security

No security concerns identified. The unsafe code blocks are properly justified and bounded by debug assertions.

📊 Summary

This is a solid performance optimization with generally good implementation. The main concerns are:

Fix the failing test
Add more comprehensive test coverage for UTF-8 handling
Consider the recommendations for invalid UTF-8 and helper methods

The core algorithm is correct, and the refactoring is well-executed. Once the test failure is resolved and coverage is improved, this should be ready to merge.

Review generated by Claude Code 🤖

Updated test assertions in `swc_common/src/input.rs` to use byte literals (b'x') instead of char literals ('x') to match the new byte-based Input trait signature. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Replaced all char literals with byte literals in the lexer to match the new byte-based Input trait. Key changes: - Changed char literals to byte literals: 'x' → b'x' - Fixed debug assertions in lexer methods - Updated comparisons in peek() and cur() checks - For non-ASCII Unicode characters, used cur_as_char() to get full UTF-8 character - Converted bytes to chars where needed for digit/identifier checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

- Remove unnecessary `as u8` cast in ECMAScript lexer (clippy error) - Fix XML parser to use `cur_as_char()` and byte literals for char comparisons - Fix HTML parser current_char usage (already in staged changes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…sion This fixes issues where multibyte UTF-8 characters (like '⬇' and '😀') were being corrupted during parsing after the char-to-byte conversion. The problem was that individual UTF-8 bytes were being cast directly to char using `as char`, which treats each byte as a separate Unicode code point. **Changes made:** 1. **CSS Parser** (`swc_css_parser`): - Updated string tokenization to use `Input::cur_as_char()` for non-ASCII bytes - Fixed URL tokenization to properly decode multibyte UTF-8 sequences - Fixed bad URL remnant parsing 2. **HTML Parser** (`swc_html_parser`): - Removed truly unused functions (`is_upper_hex_digit_char`, `is_lower_hex_digit_char`, `is_ascii_hex_digit_char`, `is_ascii_lower_alpha_char`, `is_ascii_alpha_char`, `append_to_doctype_token`) - Kept `is_ascii_upper_alpha_char` which is still in use - HTML parser already had correct UTF-8 handling via `current_char` field **Implementation:** Before consuming bytes, we now check if it's non-ASCII (`>= 0x80`) and call `cur_as_char()` to get the full UTF-8 character before advancing the input stream. For ASCII bytes, we continue using the fast path of `as char`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

claude · 2025-11-25T06:07:43Z

Code Review: Performance optimization - byte-based parsing

Summary

This PR refactors the parser infrastructure to operate on bytes instead of chars, which should provide significant performance improvements. The changes are well-structured and follow a consistent pattern across all parsers (ECMAScript, CSS, HTML, XML).

✅ Strengths

1. Performance-focused design (aligns with CLAUDE.md #1)

Moving from char iteration to byte slicing is a solid optimization
The UTF-8 length calculation in input.rs:144-152 is efficient and correct

ASCII fast-path optimization in input.rs:194-198 is excellent:

if c < 0x80 {
    self.input.bump_bytes(1);
} else {
    unsafe { self.input.bump(); }
}

2. Correct UTF-8 handling

The UTF-8 decoding logic correctly identifies character boundaries using the leading byte
Proper fallback to cur_as_char() for non-ASCII characters requiring full char processing
The is_non_ascii(c: u8) helper (c >= 0x80) is correct

3. API design improvements

Clear separation of concerns with cur() → byte, cur_as_char() → char
Added bump_bytes(n) for efficient multi-byte advancement
Well-documented with clear safety contracts

4. Good test coverage

Existing tests updated to work with bytes (b'o' instead of 'o')
Tests verify correct behavior after the refactor

⚠️ Issues & Concerns

1. Critical: Incomplete UTF-8 validation in `bump()` (input.rs:144-152)

The UTF-8 length calculation assumes well-formed UTF-8 but doesn't validate:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Problems:

Doesn't check for invalid UTF-8 sequences (e.g., 0xC0, 0xC1, 0xF5-0xFF)
Doesn't validate continuation bytes (0x80-0xBF)
Could read past valid UTF-8 boundaries on malformed input
Potential security issue if parsing untrusted input

Recommendation:
Either:

Add validation (slower but safer):

let len = if first_byte < 0x80 {
    1
} else if first_byte < 0xC2 {
    // Invalid: overlong encoding or continuation byte
    return Err(...);
} else if first_byte < 0xe0 {
    // Validate continuation byte
    if bytes.get(1).map_or(false, |&b| b & 0xC0 == 0x80) { 2 } else { return Err(...); }
}
// ... etc

Or document that the input must be valid UTF-8 (since Rust strings are UTF-8, this is likely safe)

Since the input comes from Rust &str which is guaranteed valid UTF-8, this is probably safe. However, add a comment explaining this assumption:

// Safety: Input is &str which is guaranteed valid UTF-8 by Rust.
// We can safely decode the UTF-8 length from the first byte without validation.

2. Potential issue: `bump_one()` at input.rs:85-86

if !self.remaining.is_empty() {
    self.remaining = unsafe { self.remaining.get_unchecked(1..) };

This assumes the first byte is a complete character. This is incorrect for multi-byte UTF-8 sequences. If called on a multi-byte character, it will:

Break on the next character boundary check
Potentially cause issues with position tracking

Questions:

Is bump_one() only called for ASCII bytes?
If yes, add an assertion: debug_assert!(self.remaining.as_bytes()[0] < 0x80);
If no, this needs to be fixed to use the UTF-8 length calculation from bump()

3. CSS Parser: Mixed byte/char handling (swc_css_parser/src/lexer/mod.rs)

In multiple places, there's inconsistent handling:

let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

This pattern is repeated many times (lines 735, 832, 850, etc.). Consider extracting to a helper:

fn byte_to_char(&self, b: u8) -> char {
    if b < 0x80 {
        b as char
    } else {
        self.input.cur_as_char().expect("non-ASCII byte should form valid UTF-8")
    }
}

4. Documentation gaps

The bump_bytes() function's safety requirements aren't clear enough. Add:

/// # Panics (in debug mode)
/// Panics if `n > self.remaining.len()`
/// 
/// # Safety
/// For UTF-8 inputs, caller must ensure `n` aligns with character boundaries

The Input trait should document that implementors must handle UTF-8 correctly

5. Code style: Unnecessary casting (aligns with CLAUDE.md #1)

Multiple instances of c as char where c is a byte:

Ok(Token::Delim { value: b'#' as char })  // Line 324
Ok(Token::Delim { value: b'@' as char })  // Line 443

This is fine, but consider if there's a more efficient representation that doesn't require casting at runtime.

6. Potential performance regression in `consume()` (swc_css_parser/src/lexer/mod.rs:194-234)

The optimization added:

if c < 0x80 {
    self.input.bump_bytes(1);
} else {
    unsafe { self.input.bump(); }
}

This adds a branch for every character. While likely faster overall, consider:

Benchmarking to verify the performance improvement
The branch might be mispredicted on mixed ASCII/non-ASCII input

🔍 Security Considerations

UTF-8 boundary safety: The bump() implementation assumes valid UTF-8. This is safe for Rust strings but could be problematic if the code is ever adapted for other inputs.
Unsafe slicing: Multiple uses of get_unchecked(). These appear correct given the debug_assert! checks, but:
- Ensure all code paths maintain the invariants
- Consider fuzzing to catch boundary issues

📝 Minor Issues

Line 78: debug_assert!(n <= self.remaining.len()); - Good! But also document this in the function doc comment.
Inconsistent patterns: Some parsers check is_non_ascii(b) before calling cur_as_char(), others don't. Standardize this pattern.
Test coverage: The existing tests were updated, but consider adding:
- Tests with multi-byte UTF-8 characters at various positions
- Tests with characters at the edge of different UTF-8 byte lengths
- Boundary tests (e.g., consuming exactly to EOF with multi-byte chars)

🎯 Recommendations

High Priority

✅ Add comment explaining UTF-8 safety assumption in bump()
⚠️ Fix or document bump_one() - clarify its ASCII-only use case
📊 Run benchmarks to quantify performance improvements
🧪 Add unit tests for multi-byte UTF-8 handling

Medium Priority

Extract repeated byte-to-char conversion pattern into helper
Improve documentation for bump_bytes() safety requirements
Audit all get_unchecked() uses with careful review

Low Priority

Consider if Token::Delim could store u8 instead of char for ASCII delimiters
Add property-based tests for UTF-8 edge cases

✨ Conclusion

This is a high-quality performance optimization that follows good Rust practices. The core concept is sound, and the implementation is mostly correct. The main concerns are:

Clarifying the UTF-8 safety assumptions
Verifying bump_one() is correct for its use cases
Measuring the actual performance impact

After addressing the UTF-8 documentation and bump_one() issues, this should be good to merge. Great work on maintaining test coverage and following the project's performance-first philosophy!

Reviewed following CLAUDE.md guidelines: performance-focused, English documentation, no unstable features, unit tests included.

claude · 2025-11-25T06:58:40Z

Code Review - PR #11318: Make all parsers work by byte instead of char

I've reviewed this PR which refactors parsers to work with bytes instead of chars for performance improvements. Here's my feedback:

✅ Strengths

Performance-focused design: The change aligns with CLAUDE.md rule ecmascript parser #1 ("Write performant code"). Working with bytes is more efficient than char iteration for ASCII-heavy parsing.
Well-structured refactoring: The core change to Input trait properly distinguishes between byte operations (cur(), peek()) and character operations (cur_as_char()).
UTF-8 handling in bump(): The implementation at swc_common/src/input.rs:143-152 correctly calculates UTF-8 character lengths using standard UTF-8 leading byte patterns.
Comprehensive scope: Changes are consistently applied across all parsers (ECMAScript, CSS, HTML, XML).

⚠️ Issues & Concerns

1. CRITICAL: UTF-8 Validation Missing in `bump()`

Location: crates/swc_common/src/input.rs:143-152

The bump() function calculates UTF-8 character length based solely on the first byte without validating that the subsequent bytes are valid UTF-8 continuation bytes:

let len = if first_byte < 0x80 {
    1 // ASCII
} else if first_byte < 0xe0 {
    2 // 2-byte UTF-8
} else if first_byte < 0xf0 {
    3 // 3-byte UTF-8
} else {
    4 // 4-byte UTF-8
};

Problem: If the input contains invalid UTF-8 (e.g., malformed sequences), this could:

Skip incorrect number of bytes
Cause position tracking corruption
Potentially panic on subsequent get_unchecked() calls with out-of-bounds indices

Recommendation: Add validation or document that input is guaranteed to be valid UTF-8. Since StringInput uses &str which is guaranteed valid UTF-8 by Rust, this should be safe, but a comment explaining this invariant would improve clarity.

2. Simplified UTF-8 Length Calculation

The current implementation uses a series of if statements. Consider using Rust's standard UTF-8 utilities or add a comment explaining why the custom implementation is preferred (likely for performance).

Could be simplified to:

let len = std::str::utf8_char_width(first_byte);

Or keep current implementation but add comment:

// Fast UTF-8 length calculation based on leading byte
// Safe because &str guarantees valid UTF-8

3. Mixed Byte/Char Handling in CSS Parser

Location: crates/swc_css_parser/src/lexer/mod.rs:726-736, 798-800

The code retrieves both byte and char representations:

let cur_byte = l.input.cur();
let cur_char = if let Some(b) = cur_byte {
    if is_non_ascii(b) {
        l.input.cur_as_char()
    } else {
        Some(b as char)
    }
} else {
    None
};

Issue: This pattern appears multiple times but cur_char is only used in the Some(_) wildcard case. This suggests the byte-to-char conversion logic could be cleaner.

Recommendation: Consider refactoring to only call cur_as_char() when needed, or add a helper function for this common pattern.

4. Unsafe Code Usage

Multiple uses of unsafe { get_unchecked() } throughout:

input.rs:71, 79, 86, 154, 201, 244, 265
All uses appear justified with proper bounds checking

Recommendation: All unsafe blocks have adequate debug_assert! guards, but consider adding safety comments explaining the invariants, especially for bump_bytes() at line 77-80.

5. Test Coverage

Observations:

Unit tests in input.rs updated to use bytes (b'o', b'/', etc.)
Tests verify basic functionality (slice, reset_to, bump)

Gaps:

No explicit tests for UTF-8 multi-byte character handling
No tests for edge cases with non-ASCII characters
No performance benchmarks included (though CI shows pending benchmark jobs)

Recommendation: Add test cases covering:

#[test]
fn src_input_multibyte_utf8() {
    with_test_sess("hello世界", |mut i| {
        // Test bump() correctly handles multi-byte UTF-8
        // Test cur() returns first byte of multi-byte char
        // Test cur_as_char() returns full character
    });
}

6. API Consistency

The trait now has:

cur() → Option<u8> (byte)
cur_as_ascii() → Option<u8> (byte if ASCII)
cur_as_char() → Option<char> (character)

This is good separation, but documentation could be clearer about when to use each method.

📝 Minor Issues

Line 144-152: Consider extracting UTF-8 length calculation to a separate const fn for reusability and testing.
Documentation: The trait methods have good docs, but the bump() implementation could use inline comments explaining the UTF-8 encoding scheme.
Performance: The refactoring should show measurable improvements, but I see benchmark jobs are still pending in CI.

🔒 Security

No security concerns identified. The refactoring maintains safety through:

Rust's &str UTF-8 guarantees
Proper bounds checking with debug assertions
No new attack surface introduced

✅ Adherence to CLAUDE.md

✅ Rule 1: Performance-focused (main goal)
✅ Rule 2: English comments/docs
✅ Rule 3: No unstable features used
✅ Rule 5: Unit tests present
⚠️ Rule 6: Tests updated but not removed (good)
⚠️ Rule 7: Could use more inline documentation
❓ Rule 8: cargo fmt likely run (formatting looks consistent)

📊 Overall Assessment

This is a solid refactoring with clear performance benefits. The core idea is sound and implementation is mostly correct. Main concerns:

Add validation explanation/comments for UTF-8 handling in bump()
Add tests for multi-byte UTF-8 characters
Consider extracting common byte-to-char conversion pattern in CSS parser

✅ Recommendation: Approve with minor changes

The changes are safe due to Rust's UTF-8 guarantees, but adding the suggested documentation and tests would significantly improve confidence and maintainability.

Review generated by Claude Code

kdy1 added this to the Planned milestone Nov 20, 2025

kdy1 self-assigned this Nov 20, 2025

kdy1 and others added 8 commits November 25, 2025 15:05

check: Work on byte

ff56616

fix

3ac280b

fix ci?

2c38b05

fi xci

7b7f3f2

kdy1 force-pushed the kdy1/parser-by-byte branch from 18bf7dd to f67c3c0 Compare November 25, 2025 06:05

kdy1 force-pushed the kdy1/parser-by-byte branch from f67c3c0 to 7b7f3f2 Compare November 25, 2025 06:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

perf(parser): Make all parsers work by byte instead of char #11318

perf(parser): Make all parsers work by byte instead of char #11318

kdy1 commented Nov 20, 2025

Uh oh!

changeset-bot bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 20, 2025

Uh oh!

github-actions bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

codspeed-hq bot commented Nov 20, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 20, 2025

Uh oh!

claude bot commented Nov 25, 2025

Uh oh!

claude bot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Uh oh!

perf(parser): Make all parsers work by byte instead of char #11318

Are you sure you want to change the base?

perf(parser): Make all parsers work by byte instead of char #11318

Conversation

kdy1 commented Nov 20, 2025

Uh oh!

changeset-bot bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

claude bot commented Nov 20, 2025

Code Review: PR #11318 - Performance optimization: Make all parsers work by byte instead of char

Overview

Positive Aspects ✅

Issues & Concerns 🔴

1. Potential UTF-8 boundary safety issue (High Priority)

2. Unsafe code without complete documentation (Medium Priority)

3. Incorrect ASCII validation (Medium Priority)

4. Missing boundary checks (Medium Priority)

5. Complex escape sequence handling (Low Priority)

Performance Considerations ⚡

Testing Recommendations 🧪

Documentation 📚

Minor Issues

Summary

Uh oh!

github-actions bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Binary Sizes

Uh oh!

codspeed-hq bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CodSpeed Performance Report

Merging #11318 will degrade performances by 3.72%

Summary

Benchmarks breakdown

Uh oh!

claude bot commented Nov 20, 2025

Code Review for PR #11318: perf(parser): Make all parsers work by byte instead of char

✅ Strengths

⚠️ Potential Issues & Concerns

1. Invalid UTF-8 handling (Medium severity)

2. Missing UTF-8 boundary validation in bump_bytes() (Low severity)

3. Character conversion pattern in CSS/HTML parsers (Low severity)

4. Test coverage (Medium severity)

5. Performance verification (Low severity)

📋 Code Quality

🔒 Security

📊 Summary

Uh oh!

claude bot commented Nov 25, 2025

Code Review: Performance optimization - byte-based parsing

Summary

✅ Strengths

1. Performance-focused design (aligns with CLAUDE.md #1)

2. Correct UTF-8 handling

3. API design improvements

4. Good test coverage

⚠️ Issues & Concerns

1. Critical: Incomplete UTF-8 validation in bump() (input.rs:144-152)

2. Potential issue: bump_one() at input.rs:85-86

3. CSS Parser: Mixed byte/char handling (swc_css_parser/src/lexer/mod.rs)

4. Documentation gaps

5. Code style: Unnecessary casting (aligns with CLAUDE.md #1)

6. Potential performance regression in consume() (swc_css_parser/src/lexer/mod.rs:194-234)

🔍 Security Considerations

📝 Minor Issues

🎯 Recommendations

High Priority

Medium Priority

Low Priority

✨ Conclusion

Uh oh!

claude bot commented Nov 25, 2025

Code Review - PR #11318: Make all parsers work by byte instead of char

✅ Strengths

⚠️ Issues & Concerns

1. CRITICAL: UTF-8 Validation Missing in bump()

2. Simplified UTF-8 Length Calculation

changeset-bot bot commented Nov 20, 2025 •

edited

Loading

github-actions bot commented Nov 20, 2025 •

edited

Loading

codspeed-hq bot commented Nov 20, 2025 •

edited

Loading

2. Missing UTF-8 boundary validation in `bump_bytes()` (Low severity)

1. Critical: Incomplete UTF-8 validation in `bump()` (input.rs:144-152)

2. Potential issue: `bump_one()` at input.rs:85-86

6. Potential performance regression in `consume()` (swc_css_parser/src/lexer/mod.rs:194-234)

1. CRITICAL: UTF-8 Validation Missing in `bump()`